We investigate evaluation metrics for dialogue response generation systemswhere supervised labels, such as task completion, are not available. Recentworks in response generation have adopted metrics from machine translation tocompare a model's generated response to a single target response. We show thatthese metrics correlate very weakly with human judgements in the non-technicalTwitter domain, and not at all in the technical Ubuntu domain. We providequantitative and qualitative results highlighting specific weaknesses inexisting metrics, and provide recommendations for future development of betterautomatic evaluation metrics for dialogue systems.
展开▼